Short Communication Classification of Metabolites with Kernel-Partial Least Squares (K-PLS)

ثبت نشده

چکیده

Numerous experimental and computational approaches have been developed to predict human drug metabolism. Since databases of human drug metabolism information are widely available, these can be used to train computational algorithms and generate predictive approaches. In turn, they may be used to assist in the identification of possible metabolites from a large number of molecules in drug discovery based on molecular structure alone. In the current study we have used a commercially available database (MetaDrug) and extracted a fraction of the human drug metabolism data. These data were used along with augmented atom descriptors in a predictive machine learning model, kernel-partial least squares (K-PLS). A total of 317 molecules, including parent drugs and their primary and secondary (sequential) metabolites, were used to build these models corresponding to individual metabolism rules, representing the formation of discrete metabolites, e.g., N-dealkylation. Each model was internally validated to assess the capability to classify other molecules that were left out. Using receiver operator curve statistics models for N-dealkylation, Odealkylation, aromatic hydroxylation, aliphatic hydroxylation, Oglucuronidation, and O-sulfation gave area under the curve values from 0.75 to 0.84 and were able to predict between 61 and 79% active molecules upon leave-one-out testing. This preliminary study indicates that K-PLS and possibly other similar machine learning methods (such as support vector machines) can be applied to predicting human drug metabolite formation in a classification manner. Improvements can be achieved using considerably larger datasets that contain more positive examples for the less frequently occurring metabolite rules, as well as the external evaluation of novel molecules. With the emphasis now on increasing the efficiency of drug discovery, there is interest in using predictive computational approaches to complement in vitro and in vivo studies. In the area of metabolism prediction, these techniques encompass pharmacophores (Ekins et al., 2001), quantitative structure-activity relationships (QSARs) (Shen et al., 2003; Balakin et al., 2004), electronic models (Korzekwa et al., 2004), and commercial drug metabolism databases (Borodina et al., 2004), as well as other methods that have been comprehensively reviewed elsewhere (de Graaf et al., 2005; Ekins et al., 2005a; de Groot, 2006). Some approaches have combined metabolite data and rules for suggesting metabolic pathways across multiple species (Erhardt, 2003). Such databases may also be useful for calculating the probability for a given metabolic reaction (Boyer and Zamora, 2002) to then indicate potential metabolites and the sites of metabolism using statistical or algorithmic approaches (Borodina et al., 2004). Although these types of comprehensive databases generally enable numerous search options to retrieve molecule structures and published information, the predictive capabilities seem limited at present (Wishart et al., 2006). A major limitation is that they are unlikely to have a complete dataset of reactions and molecular structures to extrapolate for a new molecule. In turn, the user is reliant on the quality of the published in vitro or in vivo data which, in many cases, may predate modern analytical methods, such that older published metabolic pathways may be incomplete. In reality, such database approaches provide knowledge of most published data and are perhaps limited to interpolation. The combination of different approaches to drug metabolite prediction may balance the strengths and weaknesses of each approach, and several commercial methods are now pursuing this direction. MetaDrug represents one such method, combining a manually annotated database of human drug metabolism information including xenobiotic reactions, enzyme substrates, and enzyme inhibitors with kinetic data (Ekins et al., 2005b, 2006). This database has enabled the generation of rules for predicting likely metabolic reactions. The parent molecule and metabolites may then be scored through integrated QSAR models and rules for molecule reactivity before visualizing molecules as nodes on a network diagram (Ekins et al., 2005b, 2006). Such rule-based metabolite predictions indicate that it is possible to generate many more metabolites than have been identified in the literature, which may make the methods less useful (Ekins et al., 2006). We are therefore investigating approaches to limit the metabolites to those that are most likely. Recently, a number of machine learning approaches including support vector machines and kernelpartial least squares (K-PLS) (Rosipal and Trejo, 2001) have been implemented in a single software package (Analyze/StripMiner), and The development of MetaDrug was supported by National Institutes of Health Grants 1-R43-GM069124-01 and 2-R44-GM069124-02 “In silico Assessment of Drug Metabolism and Toxicity”. Competing Financial Interest: MetaDrug is a proprietary tool developed and licensed by GeneGo, Inc. Article, publication date, and citation information can be found at http://dmd.aspetjournals.org. doi:10.1124/dmd.106.013185. ABBREVIATIONS: QSAR, quantitative structure-activity relationship; K-PLS, kernel-partial least squares; AUC, area under the curve. 0090-9556/07/3503-325–327$20.00 DRUG METABOLISM AND DISPOSITION Vol. 35, No. 3 Copyright © 2007 by The American Society for Pharmacology and Experimental Therapeutics 13185/3177405 DMD 35:325–327, 2007 Printed in U.S.A. 325 at A PE T Jornals on Jne 4, 2017 dm d.aspurnals.org D ow nladed from this package was used with several benchmark datasets (Bennett and Embrechts, 2003) including protein binding and other physicochemical properties. The results with K-PLS indicated that it could be favorably applied to other datasets to enable QSAR model construction and aid drug discovery research. In the current proof of concept study, we have used K-PLS to generate preliminary classification models to identify whether a metabolite is likely to be produced for a particular parent molecule. Materials and Methods Literature Data. Three hundred seventeen molecules were randomly extracted from the MetaDrug database (GeneGo Inc., St. Joseph, MI) (Ekins et al., 2006), and this represents a small fraction of the human drug metabolism content. These molecules were prepared as an sdf file containing data for the 65 metabolic pathways of interest (Ekins et al., 2005a) with binary data for the presence or absence of a metabolite. Descriptor Calculation. ChemTree software (GoldenHelix, Bozeman, MT) running on a Pentium 4 processor was used to generate augmented atom molecular descriptors (Young et al., 2002) representing the presence or absence of a particular heavy atom with its immediately bonded neighbors. In total, 61 descriptors were generated for the set of molecules. Data Preprocessing. Metabolic reactions with greater that two examples of the metabolite rule were then used for modeling; this narrowed down the dataset considerably. The matrix of molecular descriptors and biological activity data were then scaled (normalized) and variables with unchanging values were removed using feature selection with the Analyze/StripMiner software (software available from M.J.E. at http://www.rpi.edu/locker/82/001182/) (Embrechts et al., 2001). From the descriptors with more than 95% correlation between each other (i.e., “cousin descriptors”), only the descriptors most correlated with the response were retained. In addition, four sigma outliers were brought within 2.5 sigma. K-PLS Modeling Method and Testing. The Analyze software uses the K-PLS method (Rosipal and Trejo, 2001) with two key parameters, the number of latent variables and the Parzen window or Gaussian kernel sigma. In this study, the number of latent variables is held fixed at 5, and the Gaussian kernel sigmas are tuned using a second-order Newton method in which the performance criterion is the error minimization on the validation data using 5-fold cross-validation. The sigmas were tuned just once, using the metabolite with the most positive instance cases. Sigma tuning on just one single metabolite is a conservative approach that prevents over-tuning. Furthermore, the fact that the model still has a good predictive power on the other metabolites is another indication that over-tuning did not occur in this case. K-PLS uses kernels and can therefore be seen as a nonlinear extension of the PLS method. The commonly used radial basis function kernel or Gaussian kernel was applied, where the kernel is expressed as follows (Christianini and Shawe-Taylor, 2000):

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Classification of metabolites with kernel-partial least squares (K-PLS).

متن کامل

An Optimization Perspective on Kernel Partial Least Squares Regression

This work provides a novel derivation based on optimization for the partial least squares (PLS) algorithm for linear regression and the kernel partial least squares (K-PLS) algorithm for nonlinear regression. This derivation makes the PLS algorithm, popularly and successfully used for chemometrics applications, more accessible to machine learning researchers. The work introduces Direct K-PLS, a...

متن کامل

Sparse Kernel Orthonormalized PLS for feature extraction in large data sets

In this paper we are presenting a novel multivariate analysis method for large scale problems. Our scheme is based on a novel kernel orthonormalized partial least squares (PLS) variant for feature extraction, imposing sparsity constrains in the solution to improve scalability. The algorithm is tested on a benchmark of UCI data sets, and on the analysis of integrated short-time music features fo...

متن کامل

An In Silico Method for Screening Nicotine Derivatives as Cytochrome P450 2A6 Selective Inhibitors Based on Kernel Partial Least Squares

Nicotine and a variety of other drugs and toxins are metabolized by cytochrome P450 (CYP) 2A6. The aim of the present study was to build a quantitative structure-activity relationship (QSAR) model to predict the activities of nicotine analogues on CYP2A6. Kernel partial least squares (K-PLS) regression was employed with the electro-topological descriptors to build the computational models. Both...

متن کامل